{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 25 - Part 2\n", "\n", "First, let's import the necessary libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Are the complaints in zip code 10468 different than in New York City as a whole?\n", "\n", "Zip code 10468 contains Lehman College. We will test whether the distribution of complaints in this zip code is the same as the distribution of complaints in all of New York City.\n", "\n", "Null hypothesis: The complaints in zip code 10468 have the same distribution as complaints made in New York City as a whole.\n", "\n", "Alternative hypothesis: The complaints in zip code 10468 have a different distribution as complaints made in New York City as a whole.\n", "\n", "Test statistic: The Total Variation Distance (TVD) from Lab 14. Recall the TVD is computed between two distributions by taking the absolute difference of the probabilities for each category, summing them, and dividing by 2. \n", "Ex. `np.abs(df[\"Distribution 1\"] - df[\"Distribution 2\"]).sum()/2`\n", "\n", "Load the CSV file with call data from March 3 and 4, 2019 into the dataframe `calls`. Read the `Created Date` column in as a date/time." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display the `calls` dataframe to make sure it was loaded into memory correctly. If you want to see all column, run `pd.set_option(\"display.max_columns\",None)` first." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, get the probabilities of each complaint type in your whole dataframe and store them in the variable `nyc_probs`. \n", "\n", "
Hint:\n", "The function `value_counts()` computes how many of each complaint happened, and adding the parameter `normalize = True` will divide each count by the total number of complaints, giving the probability.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, create a new dataframe of only the calls from zip code 10468 (or a zip code of your choice.) \n", "\n", "
Hint:\n", "Create a filter and then apply it. You will have to look in the dataframe to see what the column containing the zip code is called.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "lehman_filter = calls[\"Incident Zip\"] == 10468\n", "lehman_calls = calls[lehman_filter]\n", "
\n", "\n", "Compute the probabilities of the different complaints in the 10468 zip code." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just looking at the first few probabilities in the two distributions, do you notice any differences?\n", "\n", "We will now perform the hypothesis test to formally check if there is a difference between the distributions. We will create a new dataframe containing the two distributions, and then can continue as in the jury panel example in Lab 14. To make a new dataframe called `df` from the probabilities of the NYC complaints, type `df = pd.DataFrame(nyc_probs)` below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the dataframe was created correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next let's add a column to our dataframe `df` containing the 10468 complaint probabilities. Again display the new dataframe to check your code worked." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some complaints showed up in the NYC calls, but not the calls from zip 10468. How can you tell which complaints these are in the dataframe?\n", "\n", "Complaints were in the NYC calls but not the 10468 calls have `NaN` for the probability in the 10468 column. If we wanted to replace `NaN` with a number, what should the number be?\n", "\n", "To replace the NaNs with 0's, type `df = df.fillna(0)` below and run it. Note that you need `df=` to save the changes you made." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the NaNs have been replaced by 0's by displaying `df`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to know if the differences between the 10468 complaint distribution and the NYC complain distribution are just due to chance or because the distributions are different. To test this, we need to:\n", "\n", "1. Compute the size of the sample of 10468 complaints (to know how large our samples in step 3 should be). \n", "2. Compute the Total Variation Distance (TVD) between the 10468 and NYC complaint distributions\n", "3. Compute the Total Variation Distances between samples from all complaints and the NYC complaint distribution and make a histogram of them.\n", "4. Compare the TVD from step 2 with the histogram from step 3, and accept or reject the null hypothesis.\n", "\n", "Do step 1: compute the size of the sample of 10468 complaints (to know how large our samples in step 3 should be)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do step 2: compute the Total Variation Distance (TVD) between the 10468 and NYC complaint distributions.\n", "\n", "That is, compute the TVD between the `Complaint Type` and `10468 Complaints` columns in your dataframe `df`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "np.abs(df[\"Complaint Type\"] - df[\"10468 Complaints\"]).sum()/2\n", "
\n", "\n", "We will break step 3 up. First, let's generate one sample from the NYC complaint distributions. Instead of simulating the sample like in the previous hypothesis testing examples, simply take a sample of size 144 from the `calls` dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the probabilities for the complaints in the sample, and add them to dataframe `df` as a new column. Remember to replace the NaNs with 0's." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "sample_counts = sample[\"Complaint Type\"].value_counts(normalize = True)\n", "df[\"Sample complaints\"] = sample_counts\n", "df = df.fillna(0)\n", "df\n", "
\n", "\n", "Compute the TVD between the sample complaints and all NYC complaints." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "np.abs(df[\"Sample complaints\"] - df[\"Complaint Type\"]).sum()/2\n", "
\n", "\n", "Now we want to repeat these steps (sample from `calls` and compute the TVD between the sample and all NYC complaints) many times." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "tvds = []\n", "for i in range(10000):\n", " sample = calls.sample(144)\n", " sample_counts = sample[\"Complaint Type\"].value_counts(normalize = True)\n", " df[\"Sample complaints\"] = sample_counts\n", " df = df.fillna(0)\n", " sample_tvd = np.abs(df[\"Sample complaints\"] - df[\"Complaint Type\"]).sum()/2\n", " tvds.append(sample_tvd)\n", "
\n", "\n", "Display the histogram of the TVDs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, compare the TVD between the 10468 sample and all NYC complaints with the histogram, and decide whether to reject the null hypothesis or not." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }